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D scription 

TECHNICAL FIELD 

[0001] The Invention relates to maintaining synchro- 5 
nized execution by processors in fault resilient/fault tol- 
erant computer systems. 

BACKGROUND 

10 

[0002] Computer systems that are capable of surviv- 
ing hardware failures or other faults generally fall into 
three categories: fault resilient, fault tolerant, and disas- 
ter tolerant. 

[0003] Fault resilient computer systems can continue is 
to function, often in a reduced capacity, in the presence 
of hardware failures. These systems operate in either 
an availability mode or an integrity mode, but not both. 
A system is "available" when a hardware failure does 
not cause unacceptable delays in user access, which 20 
means that a system operating in an availability mode 
is configured to remain online, if possible, when faced 
with a hardware error. A system has data integrity when 
a hardware failure causes no data loss or corruption, 
which means that a system operating in an integrity 25 
mode is configured to avoid data loss or corruption, even 
if the system must go offline to do so. 
[0004] Fault tolerant systems stress both availability 
and integrity. A fault tolerant system remains available 
and retains data integrity when faced with a single hard- 30 
ware failure, and, under some circumstances, when 
faced with multiple hardware failures. 
[0005] Disaster tolerant systems go beyond fault tol- 
erant systems. In general, disaster tolerant systems re- 
quire that loss of a computing site due to a natural or 35 
man-made disaster will not interrupt system availability 
or conrupt or lose data. 

[0006] All three cases require an alternative compo- 
nent that continues to function in the presence of the 
failure of a component Thus, redundancy of compo- 40 
nents is a fundamental prerequisite for a disaster toler- 
ant, fault tolerant or fault resilient system that recovers 
from or masks failures. Redundancy can be provided 
through passive redundancy or active redundancy, each 
of which has different consequences. 45 
[0007] A passively redundant system, such as a 
checkpoint-restart system, provides access to alterna- 
tive components that are not associated with the cun-ent 
task and must be either activated or modified in some 
way to account for a failed component. The consequent so 
transition may cause a significant Interruption of servtee. 
Subsequent system perfonnance also may be degrad- 
ed. Examples of passively redundant systems include 
stand-by servers and clustered systems. Th mecha- 
nism for handling a failure in a passively redundant sys- ss 
tern is to "fail-over", or switch control, to an alternative 
server. The current state of the failed application may 
be lost, and the application may need to be restarted in 



the other system. The fall-over and restart processes 
may cause some Interruption or delay in service to the 
users. Despite any such delay, passively redundant sys- 
tems such as stand-by servers and clusters provide 
"high availability" and do not deliver the continuous 
processing usually associated with "fault tolerance." 
[0008] An actively redundant system, such as a rep- 
lication system, provides an alternative processor that 
concurrently processes the same task and. In the pres- 
ence of a failure, provides continuous service. The 
mechanism for handling failures is to compute through 
a failure on the remaining processor. Because at least 
two processors are looking at and manipulating the 
same data at the same time, the failure of any single 
component should be invisible both to the application 
and to the user. 

[0009] The goal of a fault tolerant system Is to produce 
con^ect results In a repeatable fashion. Repeatability en- 
sures that operations may be resumed after a fault is 
detected. In a checkpoint-restart system, this entails 
rolling back to a previous checkpoint and replaying the 
inputs again from a journal file. In a replication system, 
repeatability results from simultaneous operation on 
multiple Instances of a computer. 
[0010] Many fault tolerant designs are known for sin- 
gle processor systems. There also are a few known fault 
tolerant, symmetric multi-processing ("SMP") systems. 
The extra complexity associated with providing fault tol- 
erance in an SMP system causes problems for many 
traditional approaches to fault tolerance. 
[0011] For a checkpoint-restart system, the check- 
point information is somewhat more complex, but the 
recovery algorithm remains basically the same. Repeat- 
ability can be loosely interpreted to permit the replay of 
system operation to occur differently than the original 
system operation. In other words, the allocation of work- 
load between SMP processors on the replay does not 
have to follow the allocation that was being followed 
when the fault occurred. The order of the inputs must 
be preserved, but the relativetiming of the inputs to each 
other and to the instruction streams running on the dif- 
ferent processors does not need to be preserved. 
[0012] Under this loose repeatability standard, a re- 
play is valid as long as the results produced by the replay 
are proper for the sequence of inputs. An example is an 
airiine reservation system with multiple customers (e.g., 
Mr. Smith and Ms. Jones) competing for the last seat. 
Due to input timing and processor scheduling, Ms. 
Jones gets the seat. However, before the result is post- 
ed, a fault occurs. On the replay, Mr. Smith gets the seat. 
Though producing a different result, the replay is valid 
since there is no cognizable problem associated with the 
change in result (i.e., Ms. Jones will never know she al- 
most got the s at). 

[0013] SMP adds considerable complexity to replica- 
tion systems. Corresponding processors in correspond- 
ing syst ms must produce the same results at th same 
time. The input timing must be precisely preserved with 
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respect to the multiple Instruction streams. No differ- 
ence between processor arbitration cycles Is allow d. 
because such a difference can affect who gets what re- 
source first. Mal<ing an SMP system with replication re- 
quires control of all aspects of the system that can affect 
the timing of Input data and the arbitration between proc- 
essors. 

[001 4] For these reasons, fault tolerant SMP systems 
generally are produced using the checkpoint-restart ap- 
proach. In such systems, the application and operating 
system software must be specially designed to support 
checkpoints. 

[0015] The decument EP-A-0 286 856 teaches a fault 
toterant symmetric multiprocessing system with strong- 
ly coupled compute elements. 

SUMMARY 

[0016] The invention, various aspects of which are de- 
scribed here below, is defined in detail in the appended 
claims 1 and 24. 

[0017] In one general aspect, a fault tolerant/fault re- 
silient computer system includes at least two compute 
elenrients connected to at least one controller. Each of 
the compute elements has clocks that operate asyn- 
chronously to clocks of the other compute elements. 
The compute elements operate in a first mode in which 
the compute elements each execute a first stream of in- 
structions in emulated clock lockstep. Clock lockstep 
operation requires the compute elements to perform the 
same sequence of instructions in the same order, with 
each instruction being performed in the same clock cy- 
cle by each compute element. The compute elements 
also operate In a second mode In which the compute 
elements each execute a second stream of instructions 
in instruction lockstep. Instruction lockstep operation re- 
quires the compute elements to perfomi the same se- 
quence of instructions in the same order, but does not 
require the compute elements to perform the Instruc- 
tions in the same clock cycle. 

[0018] Implementations of the computer system may 
include one or more of the following features. For exam- 
ple, each compute element may be a multi-processor 
compute element, such as a symmetric multi-processor 
(SMP) compute element. Each compute element may 
be implemented using an industry standard mother- 
board. The system may be configured to deactivate all 
but one of the processors of each compute element 
when the compute elements are operating in the second 
mode. 

[0019] The first stream of Instructions may implement 
operating system and application software, while the 
second stream of instructions implements lockstep con- 
trot software. The operating system and application soft- 
ware may b unmodified software configured for use 
with computer systems that are not fault tolerant. 
[0020] Each compute element may include on or 
more processors, memory, and a connection to the con- 



troller. The compute elements may be configured so that 
refr sh operations associated with the memory are syn- 
chronized with execution of operations by the processor. 
The system also may be configured to initiate DMA 
5 transfers to the memory when the compute elements 
are operating in the second mode and to execute the 
initiated DMA transfers when the compute elements are 
operating In the first mode. 

[0021] The system may synchronize the compute el- 

10 ements by copying contents of the memory of a first 
compute element to the memory of a second compute 
element, and resetting the processors of the first and 
second compute elements in a way that does not affect 
the memories of the compute elements. 

15 [0022] The compute elements may transition from the 
first mode of operation to the second mode of operation 
in response to an interrupt. For example, the interrupt 
may be a perfomnance counter interrupt generated by 
the compute element after the occurrence of a fixed 

20 number of clock cycles, such as processor clock cycles 
or bus clock cycles. The Interrupt also may be generated 
after the execution of a fixed number of instructions. 
When the compute elements are multi-processor com- 
pute elements having primary processors and one or 

25 more secondary processors, the primary processor may 
be configured to halt operation of the secondary proc- 
essors in response to the interrupt. 
[0023] Each compute element may generate an inter- 
rupt during the transition from the second mode of op- 

30 eratlon to the first mode of operation. This interrupt 
serves to align the processing by the compute element 
with a clocking structure of the compute element. Typi- 
cally, the interrupt is synchronized with a clock having 
the lowest frequencies of the clocking structure. 

35 [0024] The system may redirect I/O operations by the 
compute elements to the controller. The system also 
may include a second controller connected to the first 
controller and to the two compute elements. The first 
controller and a first compute element nnay be located 

40 in a first location and the second controller and a second 
compute element may be located in a second location, 
in which case the system also may include a communi- 
cations link connecting the first controller to the second 
controller, the first controller to the second compute el- 

45 ement, and the second controller to the first compute 
element. The first location may be spaced from the sec- 
ond location by more than 5 meters, by more than 100 
meters, or even by a kilometer or more. 
[0025] A benefit of creating a fault resilient/ fault tol- 

50 erant SMP system using replication is that the system 
can run standard application and operating system soft- 
ware, such as the Windows NT operating system avail- 
able from Microsoft Corporation. In addition, the system 
can do so using industry-standard processors and 

55 motherboards, such as moth rboards based on P n- 
tium s ries processors available from Intel Corporation. 
[0026] Other featur s and advantages will be appar- 
ent from the following description, including the draw- 
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Ings, and from the claims. 
DESCRIPTION OF DRAWINGS 
[0027] 

Figs. 1 and 2 are block diagrams of a fault resilient/ 
fault tolerant uni-processor computer system. 
Fig. 3 is a block diagram of a fault resilient/rault tol- 
erant multi-processor computer system. 
Fig. 4 is a block diagram of a motherboard. 
Fig. 5 is a flow chart of a procedure Implemented 
by the system of Fig. 3. 
Fig. 6 is a block diagram of a PCI interface. 
Fig. 7 is a flow chart of a procedure implemented 
by the system of Fig. 3. 

Fig. 8 is a block diagram of a system having two 
multi-processor compute elements and one I/O 
processor. 

Figs. 9A and 9B are a flow chart of a procedure im- 
plemented by the system of Fig. 8. 

DETAILED DESCRIPTION 

[0028] The fault tolerant systems described below 
mulate fully-phase-locked operation of multiple in- 
stances of a compute element. This should be contrast- 
ed to prior systems that operated multiple instances of 
a compute element in instruction lockstep, such as the 
Endurance 4000 system available from Marathon Tech- 
nologies Corporation of Boxboro, Massachusetts. In- 
struction lockstep operation occurs when multiple in- 
stances of a compute element perform the same se- 
quence of instructions in the same order. Fully-phase- 
locked operation, which also may be referred to as clock 
lockstep operation, occurs when multiple Instances of a 
compute element perform the same sequence of in- 
structions in the same order, with each instruction being 
perfomried in the same clock cycle by each instance of 
the compute element. 

[0029] In the Endurance 4000 system, the instances 
of a compute element operate in instruction stream lock- 
step. Each compute element executes the same se- 
quence of instructions prior to producing an output. The 
time needed to execute the Instruction stream varies 
due to the uncontrolled past history of each compute el- 
ement. For example, caches, table lookahead buffers, 
branch prediction logic, speculative execution logic, and 
X cution pipelines of the compute elements can have 
diff rent initial values, which, even though the instruc- 
tion streams being executed are the same, result In var- 
ying execution times. 

[0030] Instruction lockstep operation may result in 
failures when the compute elements ar SMP servers. 
In such a system, each compute element has multiple 
processors, each with Its own instruction stream. The 
Instruction streams are arbitrating for shared resources. 
This arbitration must be resolved identically in both com- 



pute elements for redundant operation. Instruction lock- 
step operation does not provide a tight enough control 
over the processors and the memory to guarantee the 
same arbitration resolution In both compute elements. 

5 [0031] Clock lockstep operation may be achieved by 
using a common oscillator to provide clocks to all in- 
stances of the compute element. However, such an im- 
plementation may be unsuited for fault tolerant opera- 
tion because it includes a single component, the com- 

10 mon oscillator, the failure of which will cause failure of 
the entire system. 

[0032] Emulated clock lockstep operation avoids the 
single point of failure and is achieved using the tech- 
niques described below. Emulated clock lockstep oper- 
15 ation offers the considerable additional benefit of per- 
mitting the different instances of a compute element to 
be separated by distances of up to a kilometer or more. 
[0033] An emulated-clock-lockstep, non-SMP, fault 
tolerant system is described below. This description Is 
followed by description of a fault tolerant SMP system 
using replication and emulated-clock-lockstep opera- 
tion. In both systems, the basic approach is to design a 
system in which multiple instances of a compute ele- 
ment are initialized into exactly the same state and then 
provided with exactly the same input stimuli from a syn- 
chronous I/O subsystem. This causes each instance to 
produce exactly the same result. 
[0034] To progress a fault tolerant non-SMP (uni- 
processor) implementation to a fault resilient/fault toler- 
ant SMP implementation, each processor is replaced by 
several processors and an ari^itration unit. Anytime that 
a processor needs access to anything beyond its Inter- 
nal cache (e.g., memory or I/O), the processor uses the 
arbitration unit to arbitrate for the external bus that con- 
nects the processors together. Given that the arbitration 
units are finite state engines initialized to the same state, 
they will follow the same sequence of arbitrations as 
long as the processors are functioning correctly. 

Unl-Processor (Non-SMP) System 

[0035] Fig. 1 illustrates a fault tolerant, non-SMP sys- 
tem 1 GO that emulates clock lockstep operation. In gen- 
eral, all computer systems perform two basic opera- 
tions: (1) manipulating and transfomriing data, and (2) 
moving the data to and from mass storage, networks, 
and other I/O devices. The system 100 divides these 
functions, both logically and physically, between two 
separate processors. For this purpose, each half of the 
system 1 00, called a tuple, includes a compute element 
fCE") 105 and an I/O processor ("lOP") 1 1 0. The com- 
pute element 105 processes user application and oper- 
ating system software. I/O requests generated by the 
compute element 1 05 ar redirected to the I/O proces- 
sor 110. This r direction is Implemented at the devic 
driver level. Th I/O processor 1 1 0 provides I/O resourc- 
es, including I/O processing, data storage, and network 
connectivity. The I/O processor 110 also controls syn- 
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chronization of the compute elements. 
[0036] The system 100 is fault tolerant in that it con- 
tinues to operate transparently to its users In the pres- 
ence of any single hardware failure. The system 100 
emulates a traditional computing environment by parti- 
tioning it into two components. The compute element 
1 05 handles all compute tasks for the operating system 
and any applications. The I/O processor 110 handles all 
I/O devices. Thus, the I/O processor handles all of the 
asynchronous activities associated with a computer, 
while the compute element handles all of the synchro- 
nous compute activities. 

[0037] To provide the necessary redundancy for fault 
tolerance, the system 1 00 includes at least two compute 
elements 105 and at least two I/O processors 110. The 
two compute elements 1 05 operate in lockstep while the 
two I/O processors 110 are loosely coupled. The I/O 
processors 1 1 0 feed both compute elements 1 05 the ex- 
act same data at a controlled place in the instruction 
streams of the compute elements. The I/O processors 
verify that the compute elements generate the same 1/ 
O operations and produce the same output data at the 
same time. The I/O processors also cross check each 
other for proper completion of requested I/O activity. 
[0038] The system 100 uses a software-based ap- 
proach in a configuration based on inexpensive, indus- 
try standard processors. For example, the compute el- 
ements 106 and I/O processors 110 may be implement- 
ed using Pentium Pro processors available from Intel 
Corporation. The system may run unmodified, industry- 
standard operating system software, such as the Win- 
dows NT operating system available from Microsoft Cor- 
poration, as well as industry-standard applications soft- 
ware. This permits a fault tolerant system to be config- 
ured by combining off-the-shelf, Intel Pentium Pro- 
based servers from a variety of manufacturers, which 
results in a fault tolerant or disaster tolerant system with 
low acquisition and life cycle costs. 
[0039] Each compute element 1 05 includes a proces- 
sor 115, memory 120, and an interface card 125 (also 
referred to as a Marathon interface card, or MIC). The 
interface card 125 includes drivers for communicating 
with two I/O processors simultaneously, as well as com- 
parison and test logic that assures results received from 
the two I/O processors are identical. In the fault tolerant 
system 1 00, the interface card 1 25 of each compute el- 
ement 105 is connected by high speed links 130, such 
as fiber optic links, to interface cards 125 of the two I/O 
processors 110. The interface cards 1 25 may be imple- 
mented as PCI-based adapters. 

[0040] Each I/O processor 110 includes a processor 
115, memory 120, an interface card 125, and I/O adapt- 
ers 135 for connection to I/O devices such as a hard 
drive 140 and a network 145. As noted above, the inter- 
face card 125 of each I/O processor 110 is connected 
by high speed links 130 to the interface cards 1 25 of the 
two compute elements 105. In addition, a high speed 
link 150, such as a private ethernet link, is provided be- 



tween the two I/O processors 1 1 0. 
[0041] All I/O task requests from the compute ele- 
ments 105 are redirected to the I/O processors 110 for 
handling. The I/O processor 110 runs specialized soft- 
5 ware that handlesall of the fault handling, disk mirroring, 
system management, and resynchronization tasks re- 
quired by the system 1 00. By using a multitasking oper- 
ating system, such as Windows NT, the I/O processor 
110 may run other, non-fault tolerant applications. In 
10 general, acomputeelementmay run Windows NT Serv- 
er as an operating system while, depending on the way 
that the I/O processor is to be used, an I/O processor 
may run either Windows NT Server or Windows NT 
Workstation as an operating system. 
15 [0042] The two compute elements 105 run lockstep 
control software, also referred to as quantum synchro- 
nization software, and execute the operating system 
and the applications in emulated clock lockstep. Disk 
mirroring takes place by duplicating writes on the disks 
20 140 associated with each I/O processor If one of the 
compute elements 105 should fail, the other compute 
element 105 keeps the system running with a pause of 
only a few milliseconds to remove the failed compute 
element 1 05 from the configuration. The failed compute 
25 element 1 05 then can be physically removed, repaired, 
reconnected, and turned on. The repaired compute el- 
ement then Is brought back automatically into the con- 
figuration by transferring the state of the running com- 
pute element to the repaired compute element over the 
30 high speed links and resynchronizing. The states of the 
operating system and applications are maintained 
through the few seconds it takes to resynchronize the 
two compute elements so as to minimize any impact on 
system users. 

35 [0043] If an I/O processor 1 1 0 fails, the other I/O proc- 
essor 110 continues to keep the system running. The 
failed I/O processorthen can be physically removed, re- 
paired and turned back on. Since the I/O processors are 
not running In lockstep, the repaired system may go 

40 through a full operating system reboot, and then may be 
resynchronized. After being resynchronized, the re- 
paired I/O processor automatically rejoins the configu- 
ration and the mirrored disks are re-mirrored in back- 
ground mode over the private connection 150 between 

45 the I/O processors. A failure of one of the mirrored disks 
is handled through the same process. 
[0044] The connections to the network 1 45 also are 
fully redundant. Network connections from each I/O 
processor 110 are booted with the same address. Only 

50 one network connection is allowed to transmit messag- 
es, while both are allowed to receive messages. In this 
way, each network connection monitors the other 
through the private ethemet. Should either network con- 
nection fail, the I/O processors will detect the failure and 

55 the remaining connection will carry the load. The 1/0 
processors notify the system manager in the event of a 
failure so that a repair can be initiat d. 
[0045] While Fig. 1 shows both connections on a sin- 
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gle network segment, this is not a requirement. Each 1/ 
O processor's network connection may be on a different 
segment of the same network. The system also accom- 
modates multiple networks, each with its own redundant 
connections. The extension of the system to disaster tol- 
erance requires only that the connection between the 
tuples be optical fiber or a connect! on having compatible 
speed. With such connections, the tuples may be 
spaced by distances of a kilometer or more. Since the 
compute elements are synchronized over this distance, 
the failure of a component or a site will be transparent 
to the users. 

[0046] Fig. 2 provides a summarized view of the sys- 
tem 100 of Fig. 1 . The system includes redundant com- 
pute elements 105 ("CEs") and I/O processors 110 
("lOPs"). Each CE 105 Is responsible for all computing 
and may be implemented using an industry standard 
motherboard. Each tOP 110 is responsible for access 
to I/O devices, and for system control. The lOPs run 
asynchronously of each other and verify that the CEs 
are performing the same operations in the same order. 
The lOPs also track each other's I/O completion to en- 
sure that no I/O Is lost. 

[0047] The CEs generate the same outputs in the ex- 
act same sequence, and run in emulated clock tockstep, 
ev n though the CE clocks are asynchronous to each 
other. The CEs are initialized to the same state and are 
fed consistent Inputs at exactly the same time. The CEs 
are periodically realigned using a self-generated inter- 
rupt that Is related to the occurrence of a quantum of 
clock cycles (e.g., 100,000 clock cycles) and is referred 
to as a quantum interrupt ("Ql"). By contrast, the prior 
Endurance 4000 system used Qls related to the com- 
pletion of a quantum of Instructions. All Inputs to the CEs 
are delivered at either an output window or after the 
completion of an instruction quantum. Both of these 
points are guaranteed to occur at the same point in the 
instruction streams of the CEs. The approach employed 
by the Endurance 4000 system Is described In U.S. Pat- 
ent Nos. 5,600,784 and 5.615,403. 

Multi-Processor (SMP) System 

[0048] Fig. 3 Illustrates a fault resilient/fault tolerant, 
symmetric multi-processing ("SMP") system 300. Each 
CE 305 of the system 300 includes a collection of proc- 
essors 310 connected by a cohnmon processor bus 315 
and an arbitration unit 320. The processors use the bus 
315 and arbitration unit 320 to access a shared memory 
325, and to access two I CPs 330 through an interface 
card 335 and high speed data links 340. 
[0049] The lOPs 330 operate identically to the lOPs 
110 of the system 100. Thus, the lOPs handle all I/O 
task requests from th processors 31 0 and run special- 
ized software that handles all of the fault handling, disk 
mirroring, system management, and resynchronlzation 
tasks r quir d by the system 300. 
[0050] One processor 310 (Identified as processor 



10 

31 Oa) of ach CE 305 serves as a primary processor 
and runs lockstep control software in addition to execut- 
ing an operating system and applications in emulated 
clock lockstep with the other CE. The remaining proc- 
5 essors In each CE 305 execute the operating system 
and applications In emulated clock lockstep with the oth- 
er CE. 

[0051] Referring to Fig. 4, a motherboard 400 for use 
In aCE305 of the system 300 Includes two ormoreproc- 
10 essors 310. Each processor may operate at a clock 
speed of, for example, 300 MHz or 350 Mhz. The proc- 
essors 310 are interconnected and connected to the ar- 
bitration unit 320 by the bus 315, which is also referred 
to as the processor bus or the front side bus ("FSB"). 
' *5 The FSB typically operates at a clock speed of 1 00 MHz. 
The arbitration unit 320 Is commonly referred to as the 
North Bridge, since it serves as a bridge from the proc- 
essor bus 315 to the memory 325 and to the PCI bus 
705. The PCI bus 705 typically is a 32 bit bus operating 
20 at 33 MHz or a 64 bit bus operating at 66 Mhz. The in- 
terface card 335 is implemented as a PCI devtee con- 
nected to the PCI bus 705. 

[0052] The PCI bus 705 is also connected to another 
component, which is commonly referred to as the South 

25 Bridge 71 0. The South Bridge includes an advanced pe- 
ripheral Interrupt controller ("APIC") 715 that provides 
interrupts to the processors 310 on an APIC bus 720. 
The processors 310 include their own APICs 725 that 
receive the interrupts. The APIC bus may be, for exam- 

30 pie. a 16,6 MHz bus. 

[0053] The motherboard 700 may be Implemented 
using an Industry standard motherboard. In this case, 
the motherboard 700 also may include a number of com- 
ponents that, though standard on the motherboard, are 

35 not used by the system 300. These components include 
a video card 730 connected to the North Bridge 320 by 
an AGP bus 735 (or by the PCI bus); one or more SCSI 
controllers 740 connected to the PCI bus 705; one or 
more PCI devices 745 connected to the PCI bus 705; 

40 an IDE drive controller 750 connected to the South 
Bridge 710; an ISA (16 bit, B Mhz) or EISA (32 bit, 8 
Mhz) bus 755 connected to the South Bridge 710; one 
or more ISA or EISA devices 760 connected to the bus 
755; and a super I/O controller 765 connected to the bus 

45 755 to provide keyboard, mouse, and floppy drive sup- 
port, as well as parallel and serial ports. These compo- 
nents, If present, are not used by the CE 305. 
[0054] Marathon's prior Endurance 4000 system pro- 
vided a fault tolerant structure in which processors were 

50 kept in lockstep while disregarding time skew. In es- 
sence, the time difference between processors was not 
important, assuming asynchrony between processors 
did not affect instruction lockstep. Memory refresh and 
DMA interactions, which had no impact on th lockstep 

55 of the processors, did affect th timing asynchrony. Vid- 
eo processing had both a timing and an instruction com- 
ponent. Care was tak n to ensur that video and quan- 
tum processing created neither instruction nor data dl- 
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vergence. 

[0055] When progressing from a uni-processor de- 
sign to an SMP design, the addition of one or more proc- 
essors in each CE impacts both timing and instruction 
execution. The multiple processors interact with each 
other directly and Indirectly. The direct interaction is 
through SMP features provided by the processors, such 
as the HALT instruction and interprocessor Interrupts 
provided by the Intel Pentium Pro processor The indi- 
rect interaction is through formal and infomial sema- 
phore mechanisms. Provided that the clock structure 
and processor state are sufficiently coordinated, these 
semaphores align themselves. 

[0056] Referring again to Fig. 3, the system 300, like 
the system 1 00, achieves fault tolerance by clock phase 
lockstep operation by the two CEs. Given two CEs in 
clock phase lockstep and synchronous control of all in- 
puts to the CEs, the CEs will execute exactly the same 
instruction stream at precisely the same time. This mod- 
el avoids any need to understand and control all opera- 
tions that are used by applications when dealing with an 
SMP system. 

[0057] As previously noted, the effect of clock phase 
lockstepped CEs may be produced without actually 
locking the clocks of the CEs together. Clock phase lock- 
step guarantees that every operation of a CE is started 
with exactly the same clock alignment in each CE with- 
out ever having to do anything to maintain that alignment 
once the initial lock is established. The effect of CE clock 
alignment can be produced using asynchronous CEs 
provided that a realignment is done whenever an oper- 
ation that could cause misalignment of the CEs occurs. 
Thus, to achieve the effect of CE clock alignment, the 
system 300 controls the CEs to behave tike automata, 
synchronizes the CEs to the same initial conditions, pre- 
vents divergence of the CEs due to asynchronous be- 
havior, and periodically realigns the CEs. 
[0058] The CEs generally operate In two modes.The 
first mode is used for normal processing of applications 
and operating system software. In this mode, the CEs 
operate in emulated clock lockstep. The second mode 
is used during realignment of the CEs and other system- 
level operations. In this mode, the CEs operate in in- 
struction lockstep. 

1 ■ CE Automata Behavior 

[0059] It is relatively easy to constrain the two CE 
motherboards to behave as automata. Alt that is needed 
is to disable all devices that generate non-reproducible 
ev nts, such as real time clocks, and emulate them us- 
ing software, 

2. CE Initialization/Synchronization 

[0060] In a conventional, fault intolerant SMP system, 
all processors become activ simultaneously. Thereaf- 
ter, at some point in the initialization process, all proc- 



essors other than the primary processor are deactivat- 
ed. Once the primary processor has set up the system, 
the other processors are activated and normal SMP ac- 
tivity begins. 

5 [0061] For fault tolerance, the activation process is 
carefully crafted, since it dictates the relative timing be- 
tween the processors. In particular, the system align- 
ment is adjusted to a known state. This requires memory 
refresh: PCI clocking, interrupt clocking, interrupt arbi- 

10 tration, CPU arbitration, I/O interactions: and all CPU 
caches to be in a known state. 

[0062] The CPUs are started with exactly the same 
state information. This is achieved through a synchroni- 
zation process. In particular, the memory contents of the 

IS running CE are copied over to the synchronizing CE. 
Once the memory contents are copied, the processor 
state is transferred. Both processors then execute a 
power fail recovery type sequence and restore their con- 
text from the memory image. 

20 [0063] The uni-processor system is not sensitive to 
cache, branch prediction, and translation buffer con- 
tents. An SMP system will be sensitive to these. One 
technique for initializing these subsystems is to initiate 
a full processor reset, which may require custom BIOS 

25 to restrict the restart time. Another technique is to exe- 
cute an algorithm that forces known values into these 
subsystems. 

[0064] Referring to Fig. 5, CE initializationysynchroni- 
zation is performed according to a procedure 500. First, 

30 one CE is loaded with operating system software , appli- 
cation software, and system control software (step 505) . 
This CE is referred to as the active CE, while the other 
CE is referred to as the synchronizing CE. Activation of 
the active CE typically includes deactivation of all proc- 

35 essors but the primary processor (step 51 0). 

• [0065] Next, the Internal state of the active CE is 
saved in the memory 325 (step 515). When the other 
processors have been deactivated, the internal state in- 
cludes just the internal state of the primary processor. 

40 Any internal values stored in the arbitration unit 320 and 
the interface card 335 also may be saved in the memory. 
The contents of the active CE's memory 325 then are 
copied to the synchronizing CE (step 520). 
[0066] After the state of the active CE is transferred, 

45 both CEs execute a reset procedure (step 525). The re- 
set procedure clears the internal state of each processor 
310, including all caches, but leaves intact the memory 
325, which contains the saved state of the active CE. 
[0067] After executing the reset procedure, the CEs 

50 wait for a software interrupt (step 530). The interrupt is 
delivered to both CEs simultaneously. Upon receiving 
the interrupt (step 535), the primary processor of each 
CE loads the stored state from memory (step 540) and 
begins emulated clock lockstep operation (step 545). 

55 This operation may include activation of the other proc- 
essors of the CE (step 550). 

[0068] In g neraL the mechanisms described above 
require all motherboard clocks to have a common base 
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frequency, and further require that the clocks can be 
phase aligned under software control. In addition, the 
motherboard must be capable of clearing all processor 
states under software control without a full motherboard 
reset, which would also clear the memory 325. This may 5 
be achieved through a hardware reset mechanism that 
permits the processor to be reset without resetting the 
I/O devices and the memory. A relatively more difficult 
way of achieving this Is to modify the BIOS to allow re- 
establishment of connections to memory and I/O after 
a full motherboard reset. This would require the proces- 
sor to snapshot all tables and other necessary parame- 
ters prior to performing the reset. 

3. Controlling the Divergence of CE Stimuli due to 
Asynchronous Behavior 

[0069] There are two fundamental sources of asyn- 
chronous behavior: asynchronous docl^s and non-syn- 
chronized events. Asynchronous clocks are found in 
video controllers, real time clocks, and I/O devices. They 
are inherently imprecise mechanisms that cannot be tol- 
erated between replicated fault resilient/fault tolerant 
computers. 

[0070] I/O requests by the CEs are intercepted and 
handled by the lOPs 330, and quantum interrupts are 
used to periodically update the real time clocks of the 
CEs. Quantum interrupts are interrupts generated after 
execution of a fixed number of clock cycles. 
[0071] The BMP system 300 provides for totally syn- 
chronous I/O. This means that all accesses to the inter- 
face card 335 connecting a C E 305 to the lOPs 330 must 
occur in a reproducible manner. This requires guaran- 
teed timing as viewed from the PCI bus, DMA that is 
aligned to some CPU controllable event, data availabil- 
ity that is synchronous to the CPU instruction stream 
across all instances, and restricted use of polling activ- 
ity. 

[0072] The input and output to the CEs are controlled 
through software and a PCI interface module of the in- 
terface card 335. Fig. 6 Illustrates the PCI interfacemod- 
ule 600. The module 600 Includes a PCI section 605 
that operates synchronously to the arbitration unit 320 
and the processors 31 0 of the CE 305. The module 600 
also includes a receiver section 610 and a transmitter 
section 615 that operate synchronously with the lOP 
330 to which they are connected. A reception memory 
620 and a transmission memory 625 act as the interface 
between the sections. Each memory functions as a du- 
al-ported memory. 

[0073] The transfer from the asynchronous timing of 
the lOP to the CE clocking is done through the use of a 
hardware protocol (referred to as a freeze protocol) Im- 
plemented by th int rface card 335. In an SMP imple- 
mentation, any polling activity with the interface card 
336 contributes to asymmetric operation between the 
CEs 31 0. Th fr eze protocol ensures that data will not 
be transferred from one of the interface memories until 
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all of the data is in the memory of each CE. 
[0074] When implementing the freeze protocol, the 
CEs, which are processing an identical instruction 
stream, each stop processing of the instruction stream 
at a common point in the instruction stream. Each CE 
then generates a freeze request message and transmits 
the freeze request message to the lOPs. An lOP re- 
ceives a freeze request message from a CE, waits for 
a freeze request message from other CEs, and, upon 
receiving a freeze request message from each CE 
processing an Identical instruction stream, generates a 
freeze response message and transmits the freeze re- 
sponse message to the CEs. Each CE, upon receiving 
a freeze response message from an lOP, waits for 
freeze response messages from other lOPs to which a 
freeze request message was transmitted, and, upon re- 
ceiving a freeze response message from each lOP, gen- 
erates a freeze release message, transmits the freeze 
release message to the lOPs, and resumes processing 
of the instruction stream (and transmission of data from 
the reception memory). The interface card 335 and the 
freeze protocol are configured so as to avoid disturbing 
the caches, TLB and BTB in the Intel Pentium Pro proc- 
essor. The freeze protocol is discussed in more detail In 
U.S. Patent No. 5,790,197, titled "FAULT HANDLING." 
[0075] Video operations exhibit asynchronous behav- 
ior due to the oscillator on the video card, which has no 
correlation with the CPU clock. Video controls are de- 
rived from this oscillator. Additionally, the video drivers 
execute code that is dependent on polling I/O registers. 
Techniques for eliminating asynchronies associated 
with video include creating a video module with guaran- 
teed timing, re-directing the video like other I/O, and cre- 
ating a virtual video module that Isolates the asynchro- 
nous timing of the actual video module. 
[0076] Non-synchronized events may occur whenev- 
er different clock rates are derived from a common os- 
cillator. For example, a 66 MHz processor clock may be 
divided down to derive a 33 MHz PCI dock. In this case, 
since the processor clock is twice as fast as the PCI 
clock, every second processor clock cycle aligns with a 
PCI cyde. SImilariy, memory refresh is triggered ap- 
proximately every 1 5 microseconds by dividing the proc- 
essor clock by 1000, which means that the processor 
clock will align with the memory once every 1000 proc- 
essor clock cycles. To guarantee reproducibility, each 
CE must be started with the same alignment of these 
normally non-synchronized elements. 
[0077] Interrupts also must be configured to be syn- 
chronous with the processor. This is done by implement- 
ing the motherboard so that intemjpt clocking is syn- 
chronous to the PCI dock, which, as noted above. Is 
aligned with the CPU clock. 

[0078] in general, Marathon's Endurance 4000 sys- 
tem used only one interrupt, which was produced based 
on instruction stream ex cution . The perf omriance coun- 
ter was tied in through th local APIC to produce an in- 
terrupt after a given number of instructions had been ex- 
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ecuted. The generation of the interrupt was synchro- 
nous to the instruction stream, but the delivery was not 
guaranteed to be synchronous. Part of the uncertainty 
resulted from the choice of an APIC clocl<, which can be 
made synchronous to the CPU clock. To remove this un- 
certainty in the system 300, the clocking structure of the 
local APIC and its interface to the processor are used 
to retime the interrupt such that all uncertainty in its de- 
livery is removed. 

[0079] In most Industry standard motherboards, 
memory refresh is synchronous to the CPU clock but it 
is asynchronous to the instruction stream. For these 
motherboards, there is no direct correlation between 
CPU execution and memory refresh. Memory refresh al- 
ters the timing of memory access between systems un- 
less it is started off with the same alignment on both sys- 
tems. For this reason, memory refresh also is controlled. 
This may be achieved through a chip set modification 
that establishes an I/O location that produces refresh 
activity in response to a read so as to permit refresh 
alignment to be forced from software. As an alternative, 
alignment can be inferred from refresh activity and made 
visible to the CPU. Yet another alternative makes re- 
fresh occur a fixed number of cycles after refresh is en- 
abled. 

[0080] Unlike the unl-processor system, the multi- 
processor system must control bus arbitration. The uni- 
processor system is able to ignore the effects of bus ar- 
bitration by aligning the instruction streams. Bus arbitra- 
tion control is unnecessary because there is no combi- 
nation of arbitration between a single processor, mem- 
ory, and I/O that will produce a different result for a well- 
behaved program, since such a program will not allow 
shared access to memory that is being written by anoth- 
er entity. 

[0081] This simple rule does not hold for multi-proc- 
essor systems. There are algorithms that use two or 
more processors and pemilt all processors to read or 
write a single location for the purpose of loosely tracking 
system metrics. Other algorithms pemriit multiple proc- 
essors to read and modify memory locations to gain ex- 
clusive access to a larger data structure without using 
bus lock structures. Thus, any change in the arbitration 
order between processors may have a dramatic effect 
on the contents of memory, which violates the reproduc- 
ibility constraint. Accordingly, a multi-processor system 
must effectively control bus arbitration. 
[0082] A first approach to controlling bus arbitration 
adjusts the inter-processor relationship. This approach 
is very complex because every internal caching mech- 
anism in the processors affects the relationship between 
the processors at the bus arbitration boundary. 
[0083] A simpler approach for controlling bus arbitra- 
tion is to reset the processors after every execution In- 
terruption. This avoids algorithm complexities at the x- 
pense of ffectively disabling all caching mechanisms 
since the caches are flushed on every reset. 
[0084] Bus arbitration is a general problem at every 



bus in the system. The CPU to cache, CPU to memory, 
CPU to PCI, and PCI to ISA buses share this problem. 
All these buses are synchronous to the CPU clock, but 
each is controlled by a divide ratio. Therefore, each must 
5 be aligned to remove timing variations. Again, the align- 
ment can be inferred or made controllable with a chip 
set modification. 

[0085] The CPU caches use a pseudo-random allo- 
cation policy that is controlled by allocation requests. 
10 The allocation policy must be aligned if cache diver- 
gence Is allowed to occur. 

[0086] DMA activity affects the arbitration of system 
buses. The DMA engine needs to be started and 
stopped based on the alignment of the system buses to 
15 avoid uncertainty. The MIC uses DMA to transfer data 
from the CE's memory to the PCI bus and ultimately to 
the lOP. Another, less efficient approach, to accounting 
for DMA involves halting the processors while DMA Is 
underway. 

20 [0087] In a CE configured for emulated clock lockstep 
operation, there are four kinds of stimuli to the mother- 
board: clocks, interrupts, data input, and other asyn- 
chronous events. Each kind of stimuli is discussed be- 
low. 

25 

a. Controlling Clocks 

[0088] A typical motherboard includes several clocks. 
The core clock drives internal processor circuits. The 

30 processor bus clock, also called the front side bus or 
FSB clock, controls operation of the processor bus. The 
memory refresh clock, which is often derived from the 
FSB clock, controls memory refresh. The PCI bus clock 
controls operation of the PCI bus. Finally, the interrupt 

35 controller clock, also called the APIC clock, controls the 
timing of inten^upts. Alignment of all of these clocks can 
be guaranteed by deriving all of them from a single os- 
cillator. 

40 b. Controlling Interrupts 

[0089] In the CEs, both the Interrupts themselves and 
the interrupt delivery mechanism must be controlled. 
The system 300 only needs three interrupts: the proe- 
ms essor counter interrupt, the inter-processor inten'upt, 
and the MIC interrupt. The processor counter interrupt 
initiates the transition from clock lockstep (nomrial 
mode) operation to instruction lockstep (system mode) 
operation. The inter-processor interrupt coordinates the 
50 transition from clock lockstep (normal mode) operation 
to instruction lockstep (system mode) operation. The 
MIC interrupt controls DMA transfers and transitions 
from instruction lockstep (system mode) operation to 
clock lockstep (nomnal mode) operation. 
55 [0090] The synchronous delivery of all interrupts re- 
quires that the APIC clock be synchronous as well as 
synchronizable to the other clocks. While the synchro- 
nous requirement can be met by deriving all clocks from 
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the same oscillator, the ability to synchronize the APIC 
clocl< using software control Is dependent on the specif- 
ics of the nnotherboard components. The APIC clock al- 
so must have appropriate skew to avoid divergence. 

c. Controlling Data Input 

[0091] Input data delivery is made synchronous using 
custom circuitry in the MIC. Data transitions between the 
lOP MIC and the CE IVIIC are based on the clock of the 
transmitter. The CE MIC accumulates data from the 
IGPs while the CE is in clock lockstep (normal mode) 
operation and makes the data available to the CE when 
the CE is in instruction lockstep (system mode) opera- 
tion. The CE MIC is designed so that DMA data transfers 
executed while the CE is in clock lockstep (nomnal 
mode) operation will be synchronous to the instruction 
stream. 

d. Uncontrolled Events 

[0092] Some events are inherently uncontrolled. 
These events are related to error conditions and alamns, 
and will cause divergence of the CEs if they are allowed 
to occur. Examples of these events include system man- 
agement ("SMI") interrupts, such as those used, for ex- 
ample, for power management, and nonmaskable 
("NMI") interrupts, such as those associated with double 
bit memory errors. The system control software disables 
the CE motherboard SMI interrupt. Events that are re- 
ported through the SMI intenrupt are monitored by the 
CEs using the lOPs as a filter. The CEs periodically read 
the SMI pending register and transfer the data to the 
tOPs. The IGPs then determine if an SMI interrupt is re- 
quired and direct all CEs to execute an SMI algorithm 
based on the pending register value returned by the 
IGPs. The SMI activity Is data divergent but not instruc- 
tion divergent. The data divergence is handled while in 
instruction lockstep (system mode) operation. 
[0093] One of the expected SMI events Is related to 
ECC errors. The algorithms used in the memory control- 
lers to handle correctable errors impact the frequency 
of CE divergence. For example, on the fly correction with 
no cycle penalty allows the CEs to continue without di- 
vergence. With this approach, the SMI code perfomns 
the write back correction at a future time. On the fly cor- 
rection with a cycle penalty causes CE divergence. Au- 
tomatic write back correction causes CE divergence. 
The more specific the error address is, the quicker the 
single bit error can be corrected. 
[0094] NMI events tend to be fatal to the compute en- 
vironment. The CE generates an error packet to the 
lOPs before letting the NMI interrupt occur. Thus, the 
IGPs are notified which CE is in error prior to detecting 
the resulting CE divergence. The IGPs respond to the 
CE rror by disabling the CE. 



4. CE Realignment 

[0095] The system realigns the CEs to account for 
clock drift between the CEs 305. Clock drift results be- 
5 cause each CE uses its own oscillator, with a common 
oscillator being used for all processors of a CE. If left 
uncorrected, clock drift could cause the CEs to drift so 
far apart that they appear to no longer function correctly. 
The system 300 accounts for clock drift in a way that 
10 does not cause the processors to diverge. 

[0096] In one approach, as illustrated In Fig. 7, the 
system uses a procedure 700 to have each CE period- 
ically poll (step 705) an 1/0 location a different number 
of times. A minimum number of executions of the polling 
15 loop (step 710) ensures that all the caches of the proc- 
essors remain consistent. For example, while one pass 
through the loop will establish the content of some cach- 
es (e.g., the LI and 12 caches), multiple (e.g., four) 
passes may be required to force consistency In other 
20 caches or buffers (e.g., the Branch Target Buffer). After 
the required number of passes, each CE then executes 
the loop a number of times needed to compensate for 
clock drift (step 715). As this is a coarse grained correc- 
tion, the system must be able to handle a clock skew 
25 the size of the grain (I.e., the size of the time taken to 
complete a loop). 

[0097] Another polling technique has each CE poll the 
1/G location the same number of times. The results of 
the polls by a CE then are ORed together to produce a 

30 final result for the CE. 

[0098] In another approach to coordinating the proc- 
essors, the CEs enter a HALT state while waiting for 1/ 
O. On the completion of I/O, the CEs are interrupted. 
Either the interrupt is tied to the alignment of the system, 

35 or an alignment process Is executed in the interrupt han- 
dler. In either case, the alignment between CEs is guar- 
anteed on exit from the interrupt handler. 

SMP CE Motherboard Requirements 

40 

[0099] A motherboard must have a number of fea- 
tures to support the synchronization of SMP CEs in the 
system 300. As noted above, the CEs operate in two 
modes: emulated clock lockstep and Instruction lock- 
45 step. During emulated clock lockstep operation, the CEs 
execute the operating system and applications soft- 
ware. During instruction lockstep operation, the CEs ex- 
ecute system control software. 

[01 00] In emulated clock lockstep operation, the initial 
50 state and all inputs are guaranteed to be identical be- 
tween motherboards. Both motherboards execute the 
same instruction stream In the same number of clock 
cycles. Execution proceeds uncontrolled for a pre-se- 
lected numb r of clock cycles. 
55 [0101] Instruction lockstep mode is then entered. In 
instruction lockstep, both motherboards xecute the 
same instruction stream, but the number of clock cycles 
required is not consistent During instruction lockstep, 
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each CE communicates with its interface card 335 to 
negotiate synchronous input delivery. Compensation for 
clock drift between CE motherboards Is also handled 
during instruction lockstep mode. 
[01 02] Returning to emulated clock lockstep mode re- 
quires realigning the clock structure of the motherboard 
with the instruction stream. Both motherboards will 
again present the exact same state during emulated 
clock lockstep mode. 

[0103] The system 300 relies on being able to emulate 
the effect of clock lock step motherboards without actu- 
ally building a phase locked clock structure between the 
motherboards. Emulating phase locked clocks elimi- 
nates the need to detect and control the interactions be- 
tween the symmetric processors on a motherboard. A 
phase locked clock structure also removes all con- 
straints on the coding style that the programmer uses 
when producing an SMP compliant application. This 
r nders the system 300 operating system and applica- 
tion independent. 

[0104] The features needed to provide the emulated 
phase lock structure include a single clock structure, a 
synchronous memory system, synchronizable memory 
refresh, a synchronous PCI bus, synchronous inten-upt 
delivery, and the ability to perfonrn a processor reset. 
Each of these requirements is described in more detail 
below. 

1 . Single Clock Structure 

[0105] The major clocking structure for the mother- 
board must be derived from a single oscillator. The re- 
quired clocks are the FSB clock, the PCI clock, and the 
APICclock. 

[01 06] All of the clocks must be derived from a single 
oscillator. The FSB clock is the highest frequency, and 
is divided down to produce the PCI clock. The PCI clock 
is divided down to produce the APIC dock. The phase 
relationship between the FSB clock, the PCI clock, and 
the APIC clock must be guaranteed, 
[01 07] Other devices and their respective clocks can 
be present on the motherboard, provided that they are 
disabled and that they do not impact the synchronous 
operation of the motherboard when they are disabled. 
Examples of potentially asynchronous onboard clocks 
include video, ethernet, SCSI, CMOS, and USB. 

2. Synchronous Memory System 

[0108] The two requirements for the memory system 
are that is must run synchronously to the FSB, and that 
state infomnation In the memory system must be either 
self-clearing or controllable by software. Current North 
Bridge chipsets meet these requirements. 
[0109] The first requirem nt is that memory must act 
like a state ngine with predictabi timing synchronous 
to the processor bus. An asynchronous memory system 
can not be used. 



[0110] The second r quirement is aimed at futur 
chipsets. The memory interface is currently dealt with 
as an invisible block of logic. The timing is not dependent 
on a long history of past activity. Write buffers and cache 

5 structures are assumed not to require software interven- 
tion to maintain lockstep operation. Memory arbitration 
algorithms are assumed to be self-clearing based on 
idle activity at the memory interface. When a new struc- 
ture that challenges these assumptions is added, a soft- 

10 ware technique for aligning the structure is required. 

3. Synchronizable Memory Refresh 

[0111] Normal memory refresh operations, such as 
15 the CAS-before-RAS (CBR) Refresh, are generated 
from a synchronous clock structure, but appear asyn- 
chronous to the Instruction stream. When transitioning 
from Instruction lockstep mode to emulated clock lock- 
step mode, the refresh operation must be realigned to 
20 the instruction steam under software control. 

[0112] Many chipsets allow refresh to be disabled. 
This does not necessarily meet the requirement. The 
delay from when refresh is reenabled until the first re- 
fresh request occurs must be the same every time that 
25 refresh is reenabled. The simplest scheme for achieving 
this is to allow software to reset the counter that creates 
the refresh request. 

[01 13] The refresh rate has an impact on system func- 
tionality. Input is queued In the interface card 335 of the 
30 CE 305 until the CE is in instmction lockstep mode. 
When the CE returns to emulated clock lockstep mode, 
input data is transfen-ed into the memory 325 of the CE. 
The more frequently the CE cycles through Instruction 
lockstep mode, the lower the latency will be for I/O op- 
35 orations. Each transition from Instruction lockstep mode 
to emulated clock lockstep mode requires refresh to be 
realigned, and the minimum time the CE can spend in 
emulated clock lockstep mode is constrained by the re- 
fresh rate. Therefore the I/O latency is constrained by 
40 the refresh rate. 

4. Synchronous PCI Bus 

[01 14] The two requirements for the PCI bus are that 
45 It must run synchronously to the front side bus, both in 
frequency and phase, and that state Information in the 
PCI bridge (i.e., the North Bridge) must be either self- 
clearing or controllable by software. Current North 
Bridge chipsets meet these requirements. 
so [0115] The only active device on the PCI bus Is the 
interface card 335. The CPU perfomis I/O reads and 
writes to the interface card 335 while in emulated clock 
lockstep mode. The interface card 335 also performs 
DMA in and out of system memory while in emulated 
55 clock lockstep mode. The bus op rations must be fully 
synchronous. 

[0116] Th PCI bridge is expected to have a limited 
set of write buffers and a short t rm arbitration scheme. 
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The system software relies on an idle PCI bus being a 
sufficient condition to clear any past historical state in- 
formation from the PCI bridge. If a new structure is add- 
ed that challenges these assumptions, a software tech- 
nique for aligning the structure is required. 

5. Synchronous Inten^upt Delivery 

[0117] The system 300 uses three interrupts: the per- 
fomnance counter internjpt through the local APIC, the 
inter-processor Interrupt using the local APIC, and a PCI 
interrupt from the interface card 335. All three inten'upt 
sources must be synchronous with respect to the Inter- 
nal structure of the processor. 

[01 18] The APIC clock must provide interrupt delivery 
to the processor so that an interrupt can be used to align 
halted processors. In addition, the APIC clock must be 
skewed with respect to the FSB clock at the processor 
pins such that no uncertainty exists in the reception of 
an interrupt request. 

6. Processor Reset 

[0119] Current Intel processors include a number of 
int mal structures that can only be cleared with a full 
processor reset. When two motherboards are first syn- 
chronized with each other, the context of the active CE 
processors related to the instruction stream is stored in 
main memory along with a restore procedure. A soft- 
ware-driven processor reset is issued to clear all proc- 
essor internal structures. Thus, both motherboards are 
put in the same initial state before starting their emulated 
clock lockstep operation. The reset causes the proces- 
sor to enter BIOS at the restart vector Control is then 
transferred from the BIOS to the restore procedure, 
which runs in the instruction lockstep mode. The restore 
procedure then initiates emulated clock lockstep mode. 
[01 20] The motherboard must provide a f eatu re for re- 
setting the processors without destroying the context of 
the current instruction stream. The BIOS must provide 
a method of redirecting processor execution to a mem- 
ory-resident recovery routine. Additionally, the proces- 
sors must not have accessed any data or Instruction di- 
vergent areas of the motherboard between the reset and 
the execution of the first instmction of the restore pro- 
cedure. 

[0121] An example of a suitable system is offered by 
the Intel 82443BX and PI1X4 chips. On the resume from 
a power-on suspend, the 82443BX generates a proces- 
sor reset. The BIOS can be directed through CMOS lo- 
cation OFh to bypass POST. BIOS can also be directed 
to vector through memory location 40:67 to the restore 
code. This combination provides a clean method of 
clearing the processor history structures. 

Simplified System 

[0122] Fig. 8 Illustrates a simplified system 800 useful 



for verifying and explaining concepts employed by the 
SMP system. The system 800 is a Y system with two 
CEs 805 and one lOP 81 0. Each CE 805 is a dual-proc- 
essor system SMP having 32 MB of memory and aflop- 

5 py drive. All other peripherals have been removed. A 
simple ISA module is attached to provide a controlled 
external interrupt. The lOP 810 is a uni-processor. 
[0123] The lOP 810 is responsible for companng the 
output results from the CEs, There is no re-directed I/O. 

10 The CEs run, and the lOP monitors their synchroniza- 
tion. 

[0124] Referring to Figs. 9A and 9B, the system 800 
operates according to a procedure 900. First, the prima- 
ry processor of each CE boots from its own floppy drive 

15 (step 905). The primary processor then loads and exe- 
cutes a custom program that Includes both the SMP 
control software and the SMP application (step 910). 
* This program causes each CE to report its status to the 
lOP and wait for a response (step 915). 

20 [0125] The lOP responds by instructing the CEs to op- 
erate for 100,000 clock cycles and to then report back 
to the lOP (step 920). Upon receiving the response, 
each CE performs a self reset to purge the processors 
of divergent data structures (step 925). 

25 [0126] The primary processor of each CE then exe- 
cutes an APIC wakeup sequence for the auxilliary proc- 
essor (step 930). The primary processor does so by 
sending an interrupt to the auxilliary processor and hatt- 
ing (step 932). The auxiliary processor responds by 

30 sending an interrupt to the primary processor and halt- 
ing (step 934). 

[0127] Upon receiving the interrupt from the auxiliary 
processor, the primary processor in each CE stops 
memory refresh (step 936). The processor then issues 

35 an inten-upt to all processors and halts (step 938). 
[0128] Upon receiving the interrupt that it issued, the 
primary processor restarts refresh (step 940) and sets 
its perfomiance counter for an Interrupt after 100,000 
clock cycles (step 945). Each processor then resets its 

40 time stamp counter ('TSC") (step 947). The time stamp 
counter counts the number of clock cycles occurring at 
the processor. The value of the counter is used by the 
lOP to monitor synchronization of the CEs. 
[01 29] Each CE then begins execution of a set of four 

45 tests that make unconstrained (no locks or semaphores) 
modifications to a 64,000 byte section of memory 325 
(step 950). Both the primary processor and the auxilliary 
processor of a CE run the tests. The Index variable that 
controls which test is run next is also unconstrained and 

50 accessed independently by the two processors of the 
CE. 

[0130] At the completion of 100,000 clock cycles by 
the primary processor, the primary processor receives 
an int rrupt (step 955). The primary processor responds 
55 by sending a snapshot of its registers and TSC to mem- 
ory and sending an interrupt to stop th auxilliary proc- 
essor using the APIC bus (step 957). The primary proc- 
essor halts (step 959) after sending the Interrupt. 
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[0131] The auxilliary processor responds to the inter- 
rupt by sending a snapshot of its registers and JSC to 
memory (step 960), sending an interrupt to the primary 
processor (step 962). and halting (step 964). 
[0132] The primary processor responds to the inter- 
rupt by sending its own snapshot and the auxilliary proc- 
essor's snapshot to the lOP for comparison (step 965) 
and halting (step 967). The send is a rate-based serial 
transmit, in which the processor sends a character and 
waits X instruction loops. This avoids polling the serial 
port. The send also could be done using a send/halt in- 
terrupt scheme. 

[0133] The lOP compares the packets from the CEs 
(step 970). The lOP then sends a single character inter- 
rupt to the CEs (step 972). If the packets from the CEs 
are the same, the intermpt tells the CEs to continue. 
Othenrt/ise, the interrupt tells the CEs to start over. 
[0134] If the interrupt tells the CE to start over (step 
975), the primary processor begins again with step 925. 
If the Interrupt tells the CE to continue (step 975), the 
primary processor stops memory refresh (step 977). 
The primary processor then issues an interrupt to itself 
and the auxilliary processor and halts (step 979). The 
primary processor generates the interrupt because, in 
the process of communicating with the tOP, the CEs 
have diverged in time. The interrupt serves to reactivate 
the auxilliary processor and to realign the instructions of 
the processors with the clocl<ing structure of the CEs. 
The interrupt must be based on the lowest frequency 
clock of the clocking structure. This may be a clock gen- 
erated In the interface card of the CE. In some systems, 
it may be the APIC clock. 

[0135] Upon receiving the interrupt that it issued, the 
primary processor restarts refresh (step 980) and sets 
its performance counter for an interrupt after 100,000 
clock cycles (step 985). Each processor then resets its 
time stamp counter ('TSC") (step 990), and the proces- 
sors proceed with step 950. 

[01 36] The system 800 operates in th ree modes of op- 
eration: divergent, timing divergent (instruction lock- 
step), and emulated clock lockstep. In the divergent 
mode, there is no correlation between what the different 
CEs execute. In the timing divergent mode, both CEs 
execute the same instruction stream but with a different 
number of clock cycles. Finally, in the emulated clock 
lockstep mode, both CEs execute exactly the same in- 
structions at exactly the same clock cycles. In the pro- 
cedure 900, steps 905-925 are divergent, steps 930-938 
and 970-979 are timing divergent, and steps 940-967 
and 980-990 are emulated clock lockstep. 
[01 37] The system BOO takes two asynchronous SM P 
CEs 805 and make them behave as if they were clocked 
synchronously. This requires controlling all sources of 
asynchronous behavior and also compensating for the 
frequency difference between the CEs. The int nt of the 
system 800 is to verify that this concept works, without 
requiring implementation of the entire hardware and 
software structure necessary for a product. 



[0138] Major sources of asynchronous behavior orig- 
inate from memory refresh, bus ariDitration, cache line 
fill algorithms, branch prediction, interrupt delivery, DMA 
activity, I/O polling, and video refresh. These can be 
5 controlled by proper initialization and by not allowing di- 
vergent code execution. The sources of asynchronous 
and/or divergent behavior are addressed below. 

1 . Context coordination 

10 

[0139] In an emulated clock lockstep SMP implemen- 
tation, divergent code results in excessive overhead 
necessary to resynchronize the CEs. The system 800 
overcomes the handicap of using divergence oriented 
15 interface hardware. 

2. Memory Refresh 

[0140] Memory refresh is an automatic activity con- 
20 trolled by the CPU/PCI bridge chip set. Memory refresh 
is known to cause divergence between instances of an 
SMP system because it modifies the access time for 
memory. This eventually results in reordering of the in- 
terlocks between processors. Refresh can be easily re- 
25 aligned with a minor modification in the bridge chip set. 
The system 800 removes refresh interaction by turning 
off refresh . As long as ail pages of memory are accessed 
frequently enough, memory content will not be lost. The 
refresh interval for DRAM chips is typically specified as 
30 8 milliseconds. However, longer refresh intervals have 
generally been successful. 

[0141] In the system 800, memory can be refreshed 
in one of two ways. Either a hurry up refresh rate can 
be programmed, or a CPU directed memory walk can 

35 be performed. In such a memory walk, the application 
is responsible for sweeping through the pages of mem- 
ory to keep the DI^M cells alive. An alternative is to 
sense the alignment of refresh and to hold off processor 
activity until the proper alignment is reached. This is not 

40 an option for an actual product since this would entail 
throwing away 16 microseconds on each quantum In- 
terrupt. 

3. Bus arbitration 

45 

[0142] The I/O buses in a machine are synchronous 
to the CPU clock. The bus clocking is derived from the 
CPU clock through a divider. To keep the SMP mother- 
boards operating together, the processors always start 
50 off on the same divide count. 

4. Cache line fill algorithms 

[0143] The goal is to not affect the cache line fill. The 
55 initialbootandloadprocess will most Ilk ly disturb each- 
ing. Even if the same data is in each , it may be present 
in diff rent lines of the cache. The cache can be flushed 
using a processor reset operation. As an alternative, a 
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flush algorithm that provides guaranteed cache results 
at its completion can be used. When used, this algorithm 
is perfomied after the application has loaded. 

5. Branch prediction 

[0144] The branch prediction logic is another form of 
caching. It is sensitive to slightly different factors than a 
normal cache. Branch prediction has a content based 
on the recent history of branches. An instruction cache 
is modified on the first pass through a loop. Branch pre- 
diction is modified based on how many times a particular 
branch is taken. The branch prediction logic can be 
cleared using the processor reset operation. As an al- 
ternative, algorithm that uses polling without leaving the 
branch table divergent can be used. 

6. Clock drift adjustment 

[0145] The CEs operate as if in clock lock step. Each 
CPU takes exactly the same number of clock cycles to 
do the same job. Since the clocks are not frequency 
locked, the CEs diverge in time, but not in function. A 
gross exaggeration would have one CE running at 62 
MHz and the other at 68 MHz for a nominal 60 MHz sys- 
t m. Attheendof asecond, oneCE is four million cycles 
behind the other. This can be remedied by wasting time 
in the faster CE without causing divergence. One tech- 
nique is to execute a do nothing loop in both processors, 
with one processor executing it just enough to reorder 
the branch prediction and caches, and the other proc- 
essor executing It until the designated dock cycles have 
been wasted. 

7. Interrupt delivery 

[0146] Intemipts are controlled by a different clock 
than the CPU clock. This interrupt clock is made syn- 
chronous to the CPU clock. As with bus arbitration, the 
processors are aligned to the interrupt clock. 

8. DMA Activity 

[0147] Data is moved from main memory to MIC 
memory in a way that does not affect the relationship of 
the processors. DMA is started synchronously to some 
activity that is understood by both the CEs and their 
MICs. In the system 800, this means that DMA is only 
allowed when the processors are halted, which avoids 
interaction. 

9. I/O Polling 

[0148] Any attempt to access data outside the CE 
may result In potentially divergent behavior. One solu- 
tion to this problem is to incorporate the algorithm for 
branch pr diction along with a custom MIC. The system 
800 solves this problem by severely restricting I/O. The 



CPU/MIC interface is handled as a half duplex link with 
HALT and interrupt being used as the semaphores. 
[0149] Other embodiments are within the scope of the 
following claims. 

5 

Claims 

1. A fault tolerant/fault resilient computer system 
10 (100), comprising: at least two compute elements 

(1 05) having clocks that operate asynchronously to 
clocks of the other compute elemente; and 

at least one controller connected to the at least 

15 two compute elements (1 05), 

the system being characterized in that; 
the compute elements (105) operate in a first 
mode in which the compute elements (105) 
each execute a first stream of Instructions in 

20 emulated clock lockstep, wherein the compute 

elements (105) perform the same sequence of 
instructions in the same order, with each in- 
struction being performed in the same clock cy- 
cle by each compute element, 

25 the compute elements (105) operate in a sec- 

ond mode in which the compute elements (1 05) 
each execute a second stream of Instructions 
in instruction lockstep, wherein the compute el- 
ements perform the same sequence of instruc- 

30 tions in the same order, but are not required to 

perform the instructions in the same clock cy- 
cle, 

the compute elements (1 05) operate in the first 
mode during nonnal processing of applications 

35 and operating system software, and 

the compute elements (1 05) operate in the sec- 
ond mode during processing of lockstep control 
software used in establishing or maintaining 
aligned operation of the compute elements 

40 (105). 

2. The computer system of dalm 1, wherein the at 
least two compute elements each comprise a mul- 
tiprocessor compute element. 

45 

3. The computer system of claim 2, wherein the at 
least two compute elements each comprise a sym- 
metric multi-processor (SMP) compute element. 

50 4. The computer system of claim 2, wherein each com- 
pute element is implemented using an industry 
standard motherboard. 

5. The computer syst m of claim 1 , wherein the op r- 
55 ating system and application software comprise un- 
modified software configured for use with computer 
systems that are not fault tolerant. 
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6. The computer system of claim 2, wherein th sys- 
tem is configured to deactivate all but one of the 
processors of each compute element when the 
compute elements are operating in the second 
mode. 

7. The computer system of claim 1 , wherein each com- 
pute element comprises a processor, memory, and 
a connection to the controller. 

8. The computer system of claim 7, where each com- 
pute element Is configured so that refresh opera- 
tions associated with the memory are synchronized 
with execution of operations by the processor. 

9. The computer system of claim 7, wherein the sys- 
tem is configured to initiate DMA transfers to the 
memory when the compute elements are operating 
in the second mode and to execute the initiated 
DI\/I A transfers when the compute elements are op- 
erating in the first mode. 

1 0. The computer system of claim 7, wherein the sys- 
tem is configured to synchronize compute elements 
by: 

copying contents of the memory of a first com- 
pute element to the memory of a second com- 
pute element; and 

resetting the processors of the first and second 
compute elements without affecting the mem- 
ories of the compute elements. 

11. The computer system of claim 1 .wherein each com- 
pute element is configured to transition from the first 
mode of operation to the second mode of operation 
in response to an Interrupt. 

12. The computer system of claim 11, wherein the In- 
terrupt comprises a performance counter inten'upt 
generated by the compute element after the occur- 
rence of a fixed number of clock cycles. 

13. The computer system of claim 12, wherein the in- 
terrupt comprises a perfomnance counter Inten^upt 
generated by the compute element after the occur- 
rence of a fixed number of processor clock cycles. 

14. The computer system of claim 12, wherein the in- 
terrupt comprises a performance counter interrupt 
generated by the compute element after the occur- 
rence of a fixed number of bus clock cycles. 

15. The computer system of claim 11 , wherein the in- 
terrupt comprises an interrupt generated by the 
compute element after th execution of a fixed 
number of instructions. 



16. The computer syst m of claim 11 , wherein the at 
least two compute elements comprise a mulit-proc- 
essor compute element having a primary processor 
and one or more secondary processors, and where- 

5 in the primary processor is configured to halt oper- 
ation of the secondary processors in response to 
the interrupt. 

17. The computer system of claim 1 , wherein each com- 
*o pute element Is configured to generate an interrupt 

during transition from the second mode of operation 
to the first mode of operation, the interrupt serving 
to align the processing by the compute element with 
a clocking structure of the compute element. 

15 

18. The computer systems of claim 1 7, wherein the in- 
terrupt is synchronized with a clock having the low- 
est frequencies of the clocking stmcture. 

20 19. The computer system of claim 1 , wherein the sys- 
tem is configured to redirect I/O operations by the 
compute elements to the controller. 

20. The computer system of claim 1 , further comprising 
25 a second controller connected to the first controller 

and to the at least two compute elements. 

21 . The computer system of claim 20, wherein the first 
controller and a first one of the compute elements 

30 are located in a first location and the second con- 
troller and a second one of the compute elements 
are located In a second location, and further com- 
prising a communications link connecting the first 
controllerto thesecond controller, the first controller 
35 to the second one of the compute elements, and the 
second controller to the first one of the compute el- 
ements. 

22. The computer system of claim 21 , wherein the first 
40 location is spaced from the second location by more 

than 5 meters, 

23. The computer system of claim 22, wherein the first 
location is spaced from the second location by more 

45 than 100 meters. 

24. A method of operating a fault tolerant/fault resilient 
computer system (1 GO) having at least two compute 
elements (1 05) connected to at least one controller 

50 and having clocks that operate asynchronously to 
clocks of the other compute elements, the method 
characterized by comprising: 

operating the compute elements (105) in a first 
55 mode in which the compute elements each x- 

ecute first stream of instructions in emulated 
clock lockstep, wherein the compute elem nts 
(105) perform the same sequence of instruc- 
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tions in th same order with each instmction 
being performed in the same clock cycle by 
each compute element, and 
operating the compute elements (1 05) in a sec- 
ond mode In which the compute elements (1 05) 
each execute a second stream of Instructions 
in instruction lockstep, wherein the compute el- 
ements (1 05) perfomn the same sequence of in- 
structions in the same order, but are not re- 
quired to perform the instructions in the same 
clock cycle, 

wherein the compute elements (105) operate 
in the first mode during normal processing of appli- 
cations and operating system software and in the 
second mode during processing of lockstep control 
software used in establishing or maintaining aligned 
operation of the compute elements (105). 

25. The method of claim 24, wherein each compute el- 
ement comprises a multi-processor compute ele- 
ment. 

26. The method of claim 25, further comprising deacti- 
vating all but one of the processors of each compute 
element when operating the compute elements in 
the second mode. 

27. The method of daim 24, wherein each compute el- 
ement comprises a processor, memory, and a con- 
nection to the controller. 

28. The method of claim 27, further comprising syn- 
chronizing refresh operations associated with the 
memory with execution of operations by the proc- 
essor. 

29. The method of claim 27, further comprising initiating 
DMA transfers to the memory when operating the 
compute elements in the second mode and execut- 
ing the initiated DMA transfers when operating the 
compute elements in the first mode. 

30. The method of claim 27, further comprising syn- 
chronizing the compute elements by: 

copying contents of the memory of a first com- 
pute element to the memory of a second com- 
pute element; and 

resetting the processors of the first an -:cond 
compute elements without affecting the mem- 
ories of the compute elements. 

31. The method of claim 24, further comprising transi- 
tioning from operating in the first mode to operating 
in the second mode in response to an int rrupt. 

32. The method of claim 31 , wherein the at least two 



compute elements each include a multi-processor 
compute element having a primary processor an 
one or more secondary processors, the method fur- 
ther comprising halting operation of the secondary 
5 processors in response to the interrupt. 

33. The method of claim 24, further comprising redirect- 
ing I/O operations by the compute elements to the 
controller. 

10 

PatentansprQche 

1. Fehlertolerantes bzw. fehlerrobustes Computersy- 
15 stem (1 00) mit: mindestens zwei Rechenelemente 

(105) mit Taktgebern, die gegeniiber den Taktge- 
bem der anderen Rechenelemente asynchron ar- 
beiten; und 

20 mindestens einem Controller, der/die mit den 

mindestens zwei Rechenelementen (105) ver- 
bunden Ist/sind, 

dadurch gekennzeichnet, daB 

25 

die Rechenelemente (1 05) in einem ersten Mo- 
dus aktiv sind, in dem jedes Rechenelement 
(105) In einem emulierten Takt-Lockstep einen 
ersten Strom von Anweisungen ausfuhrt, wobel 
30 die Rechenelemente (105) die gleiche Anwei- 

sungssequenz in der gleichen Reihenfolge 
ausfuhren und jede Anweisung von jedem Re- 
chenelement im gleichen Taktzyklus ausge- 
fuhrt wird, 

35 die Rechenelemente (105) in einem zweiten 

Modus aktiv sind, in dem jedes Rechenelement 
(105) Im Anweisungs-Lockstep einen zweiten 
Strom von Anweisungen ausfuhrt, wobei die 
Rechenelemente die gleiche Anweisungsse- 

40 quenz in der gleichen Reihenfolge ausfuhren, 

die Anweisungen jedoch nicht Im gleichen Takt- 
zyklus ausfuhren mussen, 
das gewohnliche Verarbeiten von Anwender- 
software und Betriebssystem durch die Re- 

4S chenelemente (105) im ersten Modus ge- 

schleht, und 

das Verarbeiten der Lockstep-Steuersoftware, 
die die Rechenelemente (105) aufelnander 
ausrichtet bzw. gewahrieistet, daB die herge- 
so stellte Ausrichtung der Rechenelemente (105) 

aufelnander erhalten bleibt, durch die Rechen- 
elemente (105) im zweiten Modus geschleht. 

2. Computersyst m nach Anspruch 1 , wob i jedes der 
S5 mind st ns zwei Rechenelem nt ein Multiprozes- 

so r- Rechenelement beinhaltet. 

3. Computersystem nach Anspruch 2. wobeljedes der 



16 



> 10292e7Bl_U> 



•31' 



EP 1 029 267 B1 



32 



mindestens zwei Rechenelemente ein Rechenele- 
ment zum symmetrischen Multiprocessing (SMP) 
beinhaltet. 

4. Computersystem nach Anspruch 2, wobei jedes 
Rechenelement durch ein der Industrienorm ent- 
sprechendies Motherboard reaiisieit wird. 

5. Computersystem nach Anspruch 1 , wobei Betriebs- 
system und Anwendersoftware nicht modifizierte 
Software beinhaiten, die fur die Anwendung mit 
nicht fehlertoleranten Connputersystemen konfigu- 
riert 1st. 

6. Computersystem nach Anspruch 2, das so konfigu- 
riert ist, da3 wenn die Rechenelemente Im zweiten 
Modus aktiv sind, die Prozessoren der Rechenele- 
mente soweit deaktivlert werden, daB Je Rechen- 
element nur noch ein Prozessor aktiv ist. 

7. Computersystem nach Anspruch 1, wobei jedes 
R chenelement einen Prozessor, einen Speicher 
und eine Verbindung zum Controller aufweist. 

8. Computersystem nach Anspruch 7, wobei jedes 
R chenelement so konfiguriert ist, daB auf den 
Speicher bezogene Aktualisierungsvorgange mtt 
d r Ausfiihrung von Operationen durch den Prozes- 
sor synchronisiert werden. 

9. Computersystem nach Anspruch 7, das so konfigu- 
riert ist, daB DMA-Transfers zum Speicher initiiert 
werden, wenn die Rechenelemente im zweiten Mo- 
dus aktiv sind,- und daB die initiierten DMA-Trans- 
fers ausgefuhrt werden, wenn die Rechenelemente 
im ersten Modus aktiv sind. 

10. Computersystem nach Anspruch 7, das so konfigu- 
riert ist, daB Rechenelemente synchronisiert wer- 
den durch: 

Kopieren von Speicherinhalt eines ersten Re- 
chenelements in den Speicher eInes zweiten 
Rechenelements; und 

Rticksetzen der Prozessoren des ersten und 
des zweiten Rechenelements, ohne daB da- 
durch die Speicher der Rechenelemente beein- 
fluBt werden. 

11. Computersystem nach Anspruch 1, wobei jedes 
R chenelement so konfiguriert ist, daB es in Reak- 
tion auf einen Inten^upt vom ersten Operationsmo- 
dus in den zweiten Operatic nsmodus wechseit. 

12. Computersystem nach Anspruch 11 , wobei der In- 
terrupt einen Interrupt des Operationszahlers mit 
einschli Bt, der nach dem Auftreten in r festge- 
legten Anzahl von Taktzyklen vom Rechenelement 



ausgelost wird. 

13. Computersystem nach Anspruch 12, wobei der In- 
terrupt einen Interrupt des Operationszahlers mit 

5 einschlieBt, der nach dem Auftreten einer festge- 
legten Anzahl von Taktzyklen des Prozessors vom 
Rechenelement ausgelost wird. 

14. Computersystem nach Anspruch 12, wobei der In- 
10 terrupt einen Interrupt des Operationszahlers mit 

einschlieBt, der nach dem Auftreten einer festge- 
legten Anzahl von Taktzyklen des Busses vom Re- 
chenelement ausgelost wird. 

IS 15. Computersystem nach Anspruch 11, wobei der In- 
terrupt einen Interrupt mit einschlieBt, der nach 
Ausfiihrung einer festgelegten Anzahl von Anwei- 
sungen vom Rechenelement ausgelost wird. 

20 1 6. Computersystem nach Anspruch 1 1 , wobei die min- 
destens zwei Rechenelemente ein Multiprozessor- 
Rechenelement mit einem Primarprozessor und ei- 
nem odermehreren Sekundarprozessoren beinhal- 
ten, und wobei der Primarprozessor so konfiguriert 

25 1st, daB in Reaktion auf den Interrupt die Aktivitaten 
der Sekundarprozessoren gestoppt werden. 

17. Computersystem nach Anspruch 1, wobei jedes 
Rechenelement so konfiguriert ist, daB wahrend 

30 des Wechsels vom zweiten Operationsmodus in 
den ersten Operationsmodus ein interrupt ausge- 
lost wird, wobei der Interrupt dazu dient, den Verar- 
beitungsprozess des Rechenelements und eine 
Taktgabestruktur des Rechenelements aufeinander 

35 auszurichten. 

18. Computersystem nach Anspruch 17, wobei der In- 
terrupt mit einem Taktgeber synchronisiert wird, der 
in der Taktgabestruktur die niedrigsten Frequenzen 

40 . hat. 

19. Computersystem nach Anspruch 1 , das so konfigu- 
riert ist, daB Ein-/Ausgabeoperationen von den Re- 
chenelementen auf den Controller umgeleitet wer- 

45 den. 

20. Computersystem nach Anspruch 1 , mit einem zwei- 
ten Controller, der mit dem ersten Controller und 
den mindestens zwei Rechenelementenverbunden 

50 ist. 

21. Computersystem nach Anspruch 20, wobei der er- 
ste Controller und ein erstes der Rechenelemente 
sich an einem ersten Ort befinden, und der zweite 

55 Controller und ein zweites der Rechenel m nte 
sich an einem zweiten Ort befinden, und mit einer 
Ubertragungsverknupfung, die den rsten Control- 
ler mit dem zweiten Controller, den ersten Control- 
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ler mit dem zweiten der Rechenelemente, und den 
zweiten Controller mit dem ersten der Rechenele- 
mente verbindet. 

22. Computersystem nach Anspruch 21 , wobei der er- 
ste Ort vom zweiten Ort mehr als 5 IVIeter entfernt 
1st. 

23. Computersystem nach Anspruch 21 , wobei der er- 
ste Ort vom zweiten Ort mehr als 1 00 Meter entfernt 

ist. 

24. Verfahren zum Betreiben eines fehlertoleranten 
bzw. fehlerrobusten Computersystems (100) mit 
mindestens zwei Rechenelementen (105), die mit 
mindestens einem Controller verbunden sind, und 
mit Taktgebern, die gegenuber den Taktgebern der 
anderen Rechenelemente asynchron arbelten, ge- 
kennzeichnet durch folgende Schritte: 

Betreiben der Rechenelemente (1 05) In einem 
ersten Modus, In dem jedes Rechenelement in 
einem emullerten Takt-Lockstep einen ersten 
Strom von Anwelsungen ausfuhrt, wobei die 
Rechenelemente (105) die gleiche Anwei- 
sungssequenz in der gleichen Reihenfolge 
ausfuhren und jede Anweisung von jedem Re- 
chenelement Im gleichen Taktzyklus ausge- 
fuhrt wird, und 

Betreiben der Rechenelemente (1 05) in einem 
zweiten Modus, in dem jedes Rechenelement 
(105) Im Anweisungs-Lockstep einen zweiten 
Strom von Anwelsungen ausfuhrt, wobei die 
Rechenelemente (105) die gleiche Anwel- 
sungssequenz in der gleichen Reihenfolge 
ausfuhren, die Anwelsungen jedoch nicht im 
gleichen Taktzyklus ausfuhren mussen, 

wobei das gewohniiche Verarbelten von An- 
wendersoftware und Betriebssystem durch die Re- 
chenelemente (105) im ersten Modus geschieht, 
und das Verarbelten der Lockstep-Steuersoftware, 
die die Rechenelemente (1 05) aufeinander ausrich- 
tet bzw. gewahrleistet, da3 die hergestellte Ausrich- 
tung der Rechenelemente (105) aufeinander erhal- 
ten bleibt, durch die Rechenelemente (105) Im 
zweiten Modus geschieht. 

25. Verfahren nach Anspruch 24, wobei jedes Rechen- 
element ein Multiprozessor-Rechenelement be- 
Inhaltet. 

26. Verfahren nach Anspruch 25, wobei, wenn die Re- 
chenelemente im zweiten Modus aktiv sind, di 
Prozessoren der Rechenelemente sow It deakti- 
viert werden, da3 je Rechenelement nur noch ein 
Prozessor aktiv ist. 



27. V rfahren nach Anspruch 24, wobei jedes Rechen- 
element einen Prozessor, einen Spelcher und eine 
Verbindung zum Controller aufweist. 

5 28. Verfahren nach Anspruch 27, wobei auf den Spel- 
cher bezogene Aktualisierungsvorgange mit der 
Ausfuhrung von Operattonen durch den Prozessor 
synchronlslert werden. 

10 29. Verfahren nach Anspruch 27, wobei DMA-Trans- 
fers zum Spelcher inltiiert werden, wenn die Re- 
chenelemente im zweiten Modus aktiv sind und die 
initiierten DMA-Transfers ausgefuhrt werden, wenn 
die Rechenelemente im ersten Modus aktiv sind. 

15 

30. Verfahren nach Anspruch 27, wobei die Rechenele- 
mente synchronlslert werden durch: 

Kopleren von Spelcherinhalt eInes ersten Re- 
20 chenelements in den Spelcher eines zweiten 

Rechenelements; und 

Riicksetzen der Prozessoren des ersten und 
des zweiten Rechenelements, ohne daB da- 
durch die Spelcher der Rechenelemente beeln- 
25 fluBt werden. 

31 . Verfahren nach Anspruch 24, wobei in Reaktion auf 
einen Interrupt vom ersten Operationsmodus In den 
zweiten Operationsmodus gewechselt wird. 

30 

32. Verfahren nach Anspruch 31 , wobei jedes der min- 
destens zwei Rechenelemente ein Multiprozessor- 
Rechenelement mit einem Primarprozessor und ei- 
nem Oder mehrefen Sekundarprozessoren beinhal- 

35 tet, und wobei in Reaktion auf den Interrupt die Ak- 
tlvltaten der Sekundarprozessoren gestoppt wer- 
den. 

33. Verfahren nach Anspruch 24, wobei EinVAusgabe- 
40 operatlonen von den Rechenelementen auf den 

Controller umgeleitet werden: 



Revendications 

45 

1 . Systfeme informatlque tol6rant les anomalies/resis- 
tant aux anomalies (100) comprenant au moins 
deux elements de cafoul (105) ayant des horloges 
qui fonctlonnent de fa^on asynchrone par rapport 

50 aux horloges des autres Elements de calcul; et au 
moins une unite de commando connect^e k au 
moins deux elements de calcul (105); le systeme 
etant caracterise en ce que tes elements de calcul 
(1 05) fonctlonnent s ton un premier mode au cours 

55 duquel les ' I ' ments de calcul (1 05) ex^cutent cha- 
cun une premiere suite d'Instructions en synchro- 
nisme d'horloge emu 16 dans lequel les elements de 
calcul (1 05) effectuent la meme sequence d'lnstruc- 
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tions dans le mSme ordre, chaque instruction 6tant 
effectu§e au cours du menne cycle d'horioge par 
chaque element de calcul, les elements de calcul 
(105) fonctionnent selon un deuxieme mode au 
cours duquel les elements de calcul (105) execu- 
tent chacun une deuxi&me suite d'instructions en 
synchronisme d'instructlon, dans lequel les ele- 
ments de calcul executent la meme sequence d'ins- 
tructions dans le meme ordre, mais n* e sont pas obli- 
ges d'executer les Instructions dans le meme cycle 
d'horioge, les elements de calcul (1 05) fonctionnent 
selon le premier mode pendant un traltement nor- 
mal d'apptications et de logiciel de systeme exploi- 
tation, et les Elements de calcu! (105) fonctionnent 
selon le second mode pendant un traitement de lo- 
giciel de commande en synchronisme utilise pour 
etablir ou conserver un fonctionnement allgn§ des 
4l6ments de calcul (105). 

2. Systeme informatlque selon la revendication 1, 
dans lequel les au moins deux 6l6ments de calcul 
comprennent chacun un element de calcul multipro- 
cesseur. 

3. Systeme informatlque selon la revendication 2, 
dans lequel les au moins deux elements de calcul 
comprennent chacun un element de calcul multipro- 
cesseur symdtrique (SMP). 

4. Systeme informatlque selon la revendication 2, 
dans lequel chaque element de calcul est realise en 
utillsant une carte m^re de norme industrietle. 

5. Systdme Informatlque selon la revendication 1, 
dans lequel le logiciel du systeme d'exploitation et 
le logiciel d'application comprennent un logiciel non 
modifi§ configure pour §tre utilise avec des syste- 
mes Informatiques qui ne sont pas tolerants aux 
anomalies. 

6. Systeme informatique selon la revendication 2, 
dans lequel le systeme est configure pour d^sacti- 
ver tous les processeurs sauf un de chaque ele- 
ment de calcul quand les elements de calcul fonc- 
tionnent selon le second mode. 

7. Systeme informatique selon la revendication 1, 
dans lequel chaque ^l^ment de calcul comprend un 
processeur, une m^moire, et une connexion k I'uni- 
te de commande. 

8. Systeme informatique selon la revendication 7, 
dans lequel chaque element de calcul est configure 
afin que les operations de rafraTchissem nt asso- 
ciees a la memoire soient synch ronisees avec I*ex6- 
cution des operations par le process ur. 

9. Systeme informatique selon la revendication 7. 



dans lequel le systeme est configure afin de deden- 
cher des transf erts de type acces direct k la memoi- 
re vers la memoire quand les elements de calcul 
fonctionnent selon le second mode et afin d'execu- 
5 ter les transferts de type acces direct a la memoire 
lances quand les elements de calcul fonctionnent 
selon le premier mode. 

10. Systeme informatique selon la revendication 7, 
10 dans lequel le systeme est configure afin de syn- 
chroniser des elements de calcul par copiage des 
contenus de la memoire d'un premier element de 
calcul dans la memoire d'un second element de cal- 
cul; et remise a zero des processeurs des premier 

15 et second elements de calcul sans affecter les me- 
moires des elements de calcul. 

11. Systeme informatique selon la revendication 1, 
dans lequel chaque element de calcul est configure 

20 afin de passer du premier mode de fonctionnement 
au deuxieme mode de fonctionnement en reponse 
k une interruption. 

12. Systeme infomriatique selon la revendication 11, 
25 dans lequel Tinterruptlon comprend une interruption 

du compteur de perfomnance generee par I'eiement 
de caicui apres Toccurrence d'un nombre fixe de. 
cycles d'horioge. 

30 13. Systeme informatique selon la revendication 12, 
dans lequel I'interruption comprend une Interruption 
du compteur de performance gener6e par Telement 
de calcul apres 1' occurrence d'un nombre fixe de cy- 
cles d'horioge du processeur. 

35 

14. Systeme infonnatlque selon la revendication 11, 
dans lequel I'interruption comprend une inten'uption 
du compteur de performance generee par reiement 
de calcul apres Toccun^ence d'un nombre fixe de cy- 

40 cles d'horioge du bus. 

15. Systeme informatique selon la revendication 12, 
dans lequel t'lnten^uption comprend une interruption 
generee par reiement de calcul apres I'execution 

45 d'un nombre fixe d'instructions. 

16. Systeme infomnatique selon la revendication 11, 
dans lequel les au moins deux elements de caicui 
comprennent un element de calcul multiprocesseur 

50 ayant un processeur primal re et un ou plusieurs 
processeurs secondaires, et dans lequel le proces- 
seur primaire est configure afin d'arreter le fonction- 
nement des processeurs secondaires en reponse 
k rinterruption. 

55 

17. Systeme informatique s Ion la rev ndication 1, 
dans lequel chaque element de calcul est configure 
afin de generer une intenxiption pendant un passa- 
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ge du second mode de fonctionnement au premier 
mode de fonctionnement, I'lnterruptlon servant k 
aligner le traltement par r6l6ment de calcul avec un 
structure d'horloge de T^lement de calcul 

18. Syst^me Infomnatique selon la revendication 17, 
dans lequel rinterruption est synch ronisee avec une 
horloge ayant les plus basses frequences de la 
structure d'horloge. 

19. Systeme informatique selon la revendication 1, 
dans lequel le systeme est configure afin de rea- 
chemlner les operations d'E/S par les elements de 
calcul vers runit§ de commande. 

20. Systeme Informatique seion la revendication 1, 
comprenant de plus une seconde unite de comman- 
de connect^e k la premiere unit6 de commande et 
aux au moins deux 6l6ments de catcul. 

21. Systeme infomnatique selon la revendication 20, 
dans (equal la premiere unite de commande et un 
premier des elements de calcul sont places en un 
premier emplacement et la deuxleme unitd de cal- 
cul et un deuxieme des elements de calcul sont pla- 
ces en un deuxieme emplacement, et comprenant 
de plus une liaison de communication connectant 
la premiere unit6 de commande k la deuxieme unit6 
de commande, la premiere unite de commande au 
deuxieme des elements de calcul, et la deuxieme 
unite de commande au premier des Elements de 
calcul. 

22. Systfeme infonnatique selon la revendication 21, 
dans lequel le premier emplacement est espace du 
deuxieme emplacement d'au moins 5 metres. 

23. Systdme infonnatique selon la revendication 22, 
dans lequel le premier emplacement est espac6 du 
deuxieme emplacement d*au moins 100 m^res. 

24. Precede de fonctionnement d'un ordinateur tolerant 

les anomalies/resistant aux anomalies (100) ayant 
au moins deux Elements de calcul (105) connectes 
a au moins une unite de commande et ayant des 
horloges qui fonctionnent de fagon asynchrone par 
rapport aux horloges des autres 6l6ments de calcul, 
le proc§d6 est caracterlse par: un fonctionnement 
des elements de calcul (105) selon un premier mo- 
de au cours duquel les ^l^ments de calcul ex6cu- 
tent chacun la preml&re suite d'instructions en syn- 
chronisme d'horloge 6mu\6, dans lequel les 
ments de calcul (1 05) effectuent la meme sequence 
d'instructions dans le meme ordre, chaque instruc- 
tion etant effectuee dans le meme cycle d'horloge 
par chaque elem nt de calcul, et un fonctionnement 
des elements de calcul (1 05) selon un second mod 
au cours duquel les 616ments de calcul (105) ex^ 



cutent chacun une deuxieme suite d'instructions n 
synchronisme d'instruction. dans lequel les 616- 
ments de catcul (1 05) effectuent la m§me sequence 
d'instructions dans le meme ordre, mais ne sont pas 

5 obliges d'effectuer les instructions dans le meme 
cycle d'horloge, dans lequel les 616ments de calcul 
(105) fonctionnent selon le premier mode pendant 
un traitement nomnal des applications et du logiciel 
de systeme d'exploitation et selon le second mode 

10 pendant un traitement du logiciel de commande de 
synchronisme utilise lors de I'^tablissement ou du 
maintien du fonctionnement alignd des 6l6ments de 
calcul (105). 

*5 25. Proced6 selon la revendication 24, dans lequel cha- 
que element de calcul comprend un element de cal- 
cul multiprocesseur. 

26. Proc6d6 selon la revendication 25, comprenant de 
20 plus une d^sactivation de tous les processeurs sauf 
un de chaque element de calcul quand il y a fonc- 
tionnement des 6l6ments de calcul selon le second 
mode. 

25 27. Precede selon la revendication 24, dans lequel cha- 
que element de calcul comprend un processeur, 
une memoire et une connexion k Vun\\6 de com- 
mande. 

30 28. Precede selon la revendication 27, comprenant de 
plus une synchronisation des operations de rafrai- 
chissement associees a la memoire avec I'exteu- 
tion d'operations par le processeur. 

35 29. Precede selon la revendication 27, comprenant de 
plus le declenchement de transferts de type acces 
direct a la memoire vers la memoire quand ii y a 
fonctionnement des 616ments de calcul selon le se- 
cond mode et I'ex^cution des transferts de type ac- 

40 ces direct k la memoire d§clench6s quand il y a 
fonctionnement des elements de calcul selon le 
premier mode. 

30. Precede selon la revendication 27, comprenant de 
45 plus la synchronisation des elements de calcul par: 

copiage des contenus de la memoire d'un premier 
6l6ment de calcul vers la memoire d'un second 616- 
ment de calcul; et remise a zero des processeurs 
des premier et second el6ments de calcul sans af- 
50 fecter les memoires des 6l6ments de calcul. 

31. Proc6d6 selon la revendication 24, comprenant de 
plus un passage du fonctionnement en premier mo- 
de V rs un fonctionnement en second mod en re- 

55 ponse a une int rruption. 

32. Precede selon la r vendication 31 , caracterlse en 
c que les au moins deux 6l6ments de calcul com- 
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prennent chacun un Element de calcul multiproces- 
seurayant un premier processeuret un ou plusieurs 
processeurs secondaires, le proced^ comprenant 
de plus une operation d'arret des processeurs se- 
condaires en rdponse a I'interruption. s 

33. Proc6d6 selon la revendication 24, comprenant de 
plus un reacheminement des operations d'E/S par 
les elements de calcul vers le circuit de commande. 
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